1 Introduction to Open Data Science

This is course is an introductory course in the Data Science which deals with Data Analysis with the help of R.

We use R, Rstudio,GitHub etc.

Link : https://potdarswapnil.github.io/IODS-project


2 Regression and model validation

2.1. Read the students2014 Dataset

Data Source: International Survey from students enrolled to Introduction to Social Statistics in fall 2014 Data Collection Timeframe : 3.12.2014 - 10.01.2015 Data Creation Date ; 14.01.2015. Sample size: 183 with 63 variables The data set was filtered with columns of interest and with points > 0 to get a dataset of 166 subjects and 7 variables

learnings <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", sep = ",", header = T) 
dim(learnings)
## [1] 166   7
head(learnings)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
str(learnings)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

2.2 Graphical overview and summary of the data

load libraries GGally and ggplot2 for plotting

library(GGally)
## Warning: package 'GGally' was built under R version 3.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
library(progress)
## Warning: package 'progress' was built under R version 3.3.2

Plot the data

plot_data <- ggpairs(learnings, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
print(plot_data)

Summarize

summary(learnings)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

Discription:

a. Bar Chart shows Females are nearly double that of males b. Attitute is much higher in males. c. Deep,surface have negative correlations in males and females d. Males have negative correlation of Age with attitute, deep, surface and points e. Females have negative correlation of Age with Surface and Points f. Points correlate heavily with attitude (Cor = 0.437), stra (Cor = 0.146) g. Histograms indicate skewness in age for males and females showing younger population

2.3. Regression model

regression_model <- lm(points ~ attitude + stra + surf, data = learnings)
summary(regression_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learnings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

In this multivariate model, we are explaining the variable points against attitude, stra and surf i.e. the explainatory variables. Based on the regression model, points have significant relationship with attitude (summary) stra and surf show no significant relationship with points R-squared value of 0.20 implies that the model can explain 20% or one-fifth of the variation in the outcome.

2.4 Interpret the model parameters after removing stra and surf

new_regression_model <- lm(points ~ attitude, data = learnings)
summary(new_regression_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learnings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

This univariate model shoes that points is significantly realted to attitude Multiple R-squared: 0.1151 R-squared = Explained variation / Total variation R-squared is always between 0 and 100%: 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. Multiple R-squared is used for evaluating how well the model fits the data. In this case, R-squared value of 0.11 implies that the model can explain only 11% of the variation in the outcome.

2.5 Diagnostic plots

my_model2 <- lm(points ~ attitude + stra, data = learnings)
# draw diagnostic plots using the plot() function. Choose the plots 1(Residuals vs Fitted values), 2(Normal QQ-plot) and 5 (Residuals vs Leverage)
par(mfrow = c(2,2))
plot(new_regression_model, which = c(1, 2, 5))

Assumptions of the model a. How well the model descrices the variables we are interested in b. Linearity: The target variable is modelled as a linear combination of the model parameters c. Errors are normally disrtibuted, not correlated and have constant variance Residual vs Fitted plot explains about variance in errors. We could see that some errors deviate from the regression line implying that there is issue with the model QQplot of our model shows that most points fall close to the line but some points are not close on the left hand side of the graph, hence the fit is somewhere near to the normality assumption. The model is reasonably okay. Leverage plot shows the impact of a single observation on the model. There are some observations (values of -3) that have a high impact on the model which is not good.


3 Logistic regression

3.1. Students aclohol dataset

Read the alcohol consumption data from following location here

Citation: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

Changes in the dataset:

The variables not used for joining the two data have been combined by averaging (including the grade variables) ‘alc_use’ is the average of ‘Dalc’ and ‘Walc’ *‘high_use’ is TRUE if ‘alc_use’ is higher than 2 and FALSE otherwise

3.2. Read the file

The data are read using the following command:

##Load required libraries
library(GGally)
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.1
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.3.2
#read the data in the file produced at the end of the assignment
alc<- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",",header=TRUE)


dim(alc)
## [1] 382  35
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
head(alc)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher
## 2     GP   F  17       U     GT3       T    1    1  at_home    other
## 3     GP   F  15       U     LE3       T    1    1  at_home    other
## 4     GP   F  15       U     GT3       T    4    2   health services
## 5     GP   F  16       U     GT3       T    3    3    other    other
## 6     GP   M  16       U     LE3       T    4    3 services    other
##       reason nursery internet guardian traveltime studytime failures
## 1     course     yes       no   mother          2         2        0
## 2     course      no      yes   father          1         2        0
## 3      other     yes      yes   mother          1         2        3
## 4       home     yes      yes   mother          1         3        0
## 5       home     yes       no   father          1         2        0
## 6 reputation     yes      yes   mother          1         2        0
##   schoolsup famsup paid activities higher romantic famrel freetime goout
## 1       yes     no   no         no    yes       no      4        3     4
## 2        no    yes   no         no    yes       no      5        3     3
## 3       yes     no  yes         no    yes       no      4        3     2
## 4        no    yes  yes        yes    yes      yes      3        2     2
## 5        no    yes  yes         no    yes       no      4        3     2
## 6        no    yes  yes        yes    yes       no      5        4     2
##   Dalc Walc health absences G1 G2 G3 alc_use high_use
## 1    1    1      3        6  5  6  6     1.0    FALSE
## 2    1    1      3        4  5  5  6     1.0    FALSE
## 3    2    3      3       10  7  8 10     2.5     TRUE
## 4    1    1      5        2 15 14 15     1.0    FALSE
## 5    1    2      5        4  6 10 10     1.5    FALSE
## 6    1    2      5       10 15 15 15     1.5    FALSE
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

3.3. Choice of depending variables, which affect alcohol consumption

The purpose of this analysis is to study the relationships between high/low alcohol consumption and some of the other variables in the data.

There are numerous parameters which are affected by the alcohol consumption. To analyse and hypothyze different relationships, one can easily study following parameters.

  1. Sex - student’s sex (binary: ‘F’ - female or ‘M’ - male): Usually a big differentiating factor.
  2. Age - student’s age (numeric: from 15 to 22) : Age plays crucial role in analyzing the alcohol consumpution (what age we should be careful)
  3. Health - current health status (numeric: from 1 - very bad to 5 - very good) : Easily the most related parameter depending upon the consumption
  4. Absences - number of school absences (numeric: from 0 to 93) : How are studies affected by the alcohol consumption

3.4 Graphical representation of distributions of the relationships between alcohol consumptions and the variables.

Box plots

box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box represents area of the first quartile to the third quartile. Line shows median and “whiskers” above and below the box show the locations of the minimum and maximum. Points plotted outside are outliers.

g1 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + ggtitle("Impact of Alchohol Consumption on grades")
g1 + geom_boxplot() + ylab("grade")

We can immediately see that when there lower use of alcohol then both males and females have identical medians with males edging a bit higher. Wih high alcohol percentage, males are affected more. ALso interesting to note that median of females’ grade does not change much however upper grade range seem to come down. Males are affected with number of outliers.

g2 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + ggtitle("Impact of Alchohol Consumption on absences")
g2 + geom_boxplot() + ylab("Absences")

Again we see males are worse affected by increasing number absences and median going up. With high use absences almost double for males and substantial increase in female absences

g3 <- ggplot(alc, aes(x = high_use, y = health, col=sex)) + ggtitle("Impact of Alchohol Consumption on Student grades and Health")
g3 + geom_boxplot() + ylab("Health") + scale_y_reverse()

Appearantly males health does not suffer much with high consumption but females are more sensitive with higher alcohol consumption

g4 <- ggplot(alc, aes(x = high_use, y = age, col=sex)) + ggtitle("Student absences by alcohol consumption and age")
g4 + geom_boxplot() + ylab("Age")

High consumption at the median age of 17 for males while 16 .5 for females. Males start early with low consumption while females start an year later.

While it is not the entire set of variables. These factors are indicative of the impact of alchol consumption on students.

Bar plots

Bar plots are one of the most commonly used kind of data visualization. They are used to display numeric values (on the y-axis), for different categories (on the x-axis).

bp <- ggplot(alc, aes(health, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Barplot of health for high and low alcohol usage")
bp

bp2 <- ggplot(alc, aes(sex, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Barplot of free time counts for high and low alcohol usage")
bp2

bp3 <- ggplot(alc, aes(absences, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Barplot of absence counts for high and low alcohol usage")
bp3

bp4 <- ggplot(alc, aes(G3, fill=high_use)) + geom_bar(position="dodge") + 
  ggtitle("Barplot of grades counts for high and low alcohol usage")
bp4

Barplots confirm the findings in the box plots in different way.

b3 <- ggplot(alc, aes(x=high_use, y=absences, fill=health)) +
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Student absences by alcohol consumption and health")
# draw the plot
b3

Cross-tabulations (contingency table)

Cross tabulations are type of tables in a matrix format that displays the (multivariate) frequency distribution of the variables. They provide a basic picture of the interrelation between two variables and can help find interactions between them.

library(descr)
## Warning: package 'descr' was built under R version 3.3.2
CrossTable(alc$health,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ===================================
##               alc$high_use
## alc$health    FALSE    TRUE   Total
## -----------------------------------
## 1                35      11      46
##               0.190   0.459        
##               0.761   0.239   0.120
##               0.130   0.098        
##               0.092   0.029        
## -----------------------------------
## 2                29      14      43
##               0.064   0.154        
##               0.674   0.326   0.113
##               0.107   0.125        
##               0.076   0.037        
## -----------------------------------
## 3                63      20      83
##               0.320   0.772        
##               0.759   0.241   0.217
##               0.233   0.179        
##               0.165   0.052        
## -----------------------------------
## 4                47      17      64
##               0.069   0.166        
##               0.734   0.266   0.168
##               0.174   0.152        
##               0.123   0.045        
## -----------------------------------
## 5                96      50     146
##               0.501   1.209        
##               0.658   0.342   0.382
##               0.356   0.446        
##               0.251   0.131        
## -----------------------------------
## Total           270     112     382
##               0.707   0.293        
## ===================================
CrossTable(alc$sex,alc$high_use, prop.t=TRUE, prop.r=TRUE, prop.c=TRUE)
##    Cell Contents 
## |-------------------------|
## |                       N | 
## | Chi-square contribution | 
## |           N / Row Total | 
## |           N / Col Total | 
## |         N / Table Total | 
## |-------------------------|
## 
## ================================
##            alc$high_use
## alc$sex    FALSE    TRUE   Total
## --------------------------------
## F            157      41     198
##            2.078   5.009        
##            0.793   0.207   0.518
##            0.581   0.366        
##            0.411   0.107        
## --------------------------------
## M            113      71     184
##            2.236   5.390        
##            0.614   0.386   0.482
##            0.419   0.634        
##            0.296   0.186        
## --------------------------------
## Total        270     112     382
##            0.707   0.293        
## ================================

3.5 Logistic regression

Model the variable high_use as a function of chosen dependent variables using logistic regression.

Logistic regression is a method for fitting a regression curve, y = f(x), when y consists of proportions or probabilities, or binary coded (0,1–failure, success) data. When the response is a binary (dichotomous) variable, and x is numeric, logistic regression fits a logistic curve to the relationship between x and y.

We define the students health as a factorial variable (it takes values 1-5). The model and results (summary and coefficients) are given using following functions:

alc$health <- factor(alc$health)
m <- glm(high_use ~ G3 + absences+sex+health, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ G3 + absences + sex + health, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8620  -0.8454  -0.6204   1.0572   2.0505  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.68344    0.49385  -3.409 0.000652 ***
## G3          -0.03396    0.02591  -1.311 0.189930    
## absences     0.07652    0.01854   4.128 3.66e-05 ***
## sexM         1.01459    0.24875   4.079 4.53e-05 ***
## health2      0.36768    0.50514   0.728 0.466686    
## health3     -0.11382    0.46137  -0.247 0.805146    
## health4      0.19310    0.47462   0.407 0.684118    
## health5      0.32903    0.41962   0.784 0.432968    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 421.53  on 374  degrees of freedom
## AIC: 437.53
## 
## Number of Fisher Scoring iterations: 4
coef(m)
## (Intercept)          G3    absences        sexM     health2     health3 
## -1.68343883 -0.03395765  0.07651849  1.01458713  0.36768133 -0.11381629 
##     health4     health5 
##  0.19310111  0.32903488
confint(m)
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) -2.69205669 -0.74729058
## G3          -0.08465022  0.01723061
## absences     0.04227944  0.11462062
## sexM         0.53341598  1.51054655
## health2     -0.61775150  1.37599976
## health3     -1.00690619  0.81490637
## health4     -0.72685195  1.14700931
## health5     -0.47001929  1.18784390
confint.default(m)
##                   2.5 %     97.5 %
## (Intercept) -2.65136084 -0.7155168
## G3          -0.08473311  0.0168178
## absences     0.04018536  0.1128516
## sexM         0.52704638  1.5021279
## health2     -0.62237191  1.3577346
## health3     -1.01808292  0.7904503
## health4     -0.73714564  1.1233479
## health5     -0.49340847  1.1514782

4 Clustering and classification

4.1. Boston dataset from MASS package

#Load the Rquired (usual) libraries
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(reshape2)
library(ggplot2)
library(GGally)
library(dplyr)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.3.2
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.3.2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Warning: package 'tibble' was built under R version 3.3.1
## Warning: package 'tidyr' was built under R version 3.3.1
## Warning: package 'readr' was built under R version 3.3.2
## Warning: package 'purrr' was built under R version 3.3.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## select(): dplyr, MASS
#load the Boston Data
data("Boston")

4.2. Boston Dataset

Load Boston data from the MASS package in R. The data is described in R documentation here. This is related to real estate details from Boston suburban area.

  • 506 observations of 14 different parameters.
  • It contains following columns crim : per capita crime rate by town. zn: proportion of residential land zoned for lots over 25,000 sq.ft. indus: proportion of non-retail business acres per town. chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). nox: nitrogen oxides concentration (parts per 10 million). rm: average number of rooms per dwelling. age:proportion of owner-occupied units built prior to 1940. dis:weighted mean of distances to five Boston employment centres. rad:index of accessibility to radial highways. tax:full-value property-tax rate per $10,000. ptratio: pupil-teacher ratio by town. black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town. lstat: lower status of the population (percent). medv: median value of owner-occupied homes in $1000s.

We will focus on the Crime rate and look at the data available.

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

4.3. Boston Dataset : Graphical Overview

pairs(Boston)

ggpairs(Boston, diag=list(continuous=“density”, discrete=“bar”), axisLabels=“show”)

```

  • numerical values of corrrelation matrix:
cor_matrix<-cor(Boston) %>% round(digits =2)
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
  • Graphical Representation of Correlation Plots in matrix format.
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

In this upper triangular correlation plot, size and color of the “circle” represent the correlation value. We can find following

  • Positive correalation of Crime rate and Rad(Distance from Highways)and Property Tax
  • Negative correlation of crime rate and Distance to Employment Centres.
  • Negative correlation of crime rate and Rooms in the house
  • RM and Age are normally distributed

4.4 Standardization of data

In the scaling we subtract the column means from the corresponding columns and divide the difference with standard deviation.

**scaled(x)=x???mean(x)sd(x)

We can use the scale function to scale Boston Data set

boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaling helps in comparing values in different ranges. We can see that all the columns are now in the same scale and Mean is zero for all the columns.

4.4. Linear Discriminant Analysis

*Linear Discriminant analysis is a classification (and dimension reduction) method. It finds the (linear) combination of the variables that separate the target variable classes. The target can be binary or multiclass variable.

  • We first convert the variable into a categorical variable with values “low”, “med_low”, “med_high”, “high”. We define this vector using quantile() function, executed on scaled continues variable crime.
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim  
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
bins <- quantile(scaled_crim) 
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(scaled_crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)

*We look the table of categorical variable to see how many observations are in every class:

r table(crime)

## crime ## low med_low med_high high ## 127 126 126 127 Next we drop the old crime rate variable from the dataset and add the new categorical value to scaled data:

r boston_scaled <- dplyr::select(boston_scaled, -crim) boston_scaled <- data.frame(boston_scaled, crime)

We divide the data to training sets and test sets with a share of 80%. We choose randomly 80% of the rows for the training set, and rest of the data for testing.

n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

4.5 LDA modeling using training data

Linear discriminant model on training data

lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2648515 0.2549505 0.2500000 0.2301980 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       0.9812016 -0.9122651 -0.08835242 -0.8756875  0.4595651
## med_low  -0.1302776 -0.3079864 -0.08120770 -0.5626616 -0.1482853
## med_high -0.3751653  0.1607534  0.15646403  0.3687568  0.1144649
## high     -0.4872402  1.0173143 -0.01832260  1.0846959 -0.3910877
##                 age        dis        rad        tax     ptratio
## low      -0.8692818  0.8464635 -0.6952911 -0.7382037 -0.46467723
## med_low  -0.3601914  0.3325631 -0.5425547 -0.4431870 -0.02654817
## med_high  0.3895256 -0.3508617 -0.4337909 -0.3439588 -0.31788646
## high      0.8145383 -0.8575186  1.6361396  1.5126335  0.77846144
##                black       lstat        medv
## low       0.37446483 -0.75753518  0.55207313
## med_low   0.32084076 -0.16845761 -0.01950704
## med_high  0.05660614 -0.04304958  0.19946718
## high     -0.67891262  0.83156084 -0.62718549
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.080110483  0.67519370 -0.94037403
## indus    0.045664772 -0.30287097  0.05783255
## chas    -0.010232758 -0.01970779  0.04486453
## nox      0.439862400 -0.73959685 -1.28296353
## rm       0.005739707 -0.03660303 -0.20821213
## age      0.195822889 -0.38227641 -0.16618838
## dis     -0.022965138 -0.33716687  0.05876677
## rad      3.312939403  0.82131923 -0.31500610
## tax      0.023661473  0.18340740  0.89458982
## ptratio  0.131660649  0.03364211 -0.22473492
## black   -0.085891056  0.03503441  0.15081973
## lstat    0.210468747 -0.18007020  0.35064662
## medv     0.092125032 -0.32562934 -0.17211315
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9499 0.0360 0.0141

Next we visualize results by using biplot function. The argument dimen in plot() function determines how many discriminants are used.

lda.fit <- lda(crime ~ ., data = train)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)

plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

4.6 LDA for prediction: use of test data.

Prediction of the LDA model to classify the test data. In order to analyze the model performance we use cross tabulation and create a table consisting of correct and predicted classes.

lda.pred <- predict(lda.fit, newdata = test)
table(correct = test$crime, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       13       7        0    0
##   med_low    7      13        3    0
##   med_high   0       4       19    2
##   high       0       0        0   34

We can use table() function to produce a matrix in order to determine how many observations were correctly or incorrectly classified.

4.7 Distance measures and clustering. K-means algorithm

  • In clustering the number of classes is unknown, and the data are grouped based on their similarity.
  • Standardize Boston Data Boston data.
  • Eucledian distance in the dist() function creates a distance matrix consisting of pairwise distances of the observations.
dist_eu <- dist(boston_scaled, method = "euclidean")
## Warning in dist(boston_scaled, method = "euclidean"): NAs introduced by
## coercion
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1394  3.5270  4.9080  4.9390  6.2420 13.0000

K-means kmeans()is an unsupervised method, that assigns observations to groups or clusters based on similarity of the objects. It takes as an argument the number of clusters to look:

r km <-kmeans(dist_eu, centers = 4)

One way to determine the number of clusters is to look at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. The optimal number of clusters is achieved when WCSS drops significantly.

The total within sum of squares twcss is defined below. K-means randomly assigns the initial cluster centers using the function set.seed() which might generate minor differences in the plots.

set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
plot(1:k_max, twcss, type='b')

km <-kmeans(dist_eu, centers = 2)
pairs(Boston, col = km$cluster)

The results show that twcss drops significantly at k=2. On the last graphics we calculate and plot kmeans function for 2 clusters because it is easier to show.

4.Bonus.1

set.seed(123)
data("Boston")
boston_scaled2 <- as.data.frame(scale(Boston, center = TRUE, scale = TRUE))
dist_eu2 <- dist(boston_scaled2, method = "euclidean")
km2 <- kmeans(dist_eu2, centers = 5)
boston_scaled2$cluster <- km2$cluster
lda.fit2 <- lda(cluster ~ ., data =boston_scaled2)
plot(lda.fit2, col=as.numeric(boston_scaled2$cluster), dimen=2)
lda.arrows(lda.fit2, myscale = 3, col = "#666666")

4.Bonus.2

library(plotly)
## Warning: package 'plotly' was built under R version 3.3.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)

dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
train$cluster <- boston_scaled2$cluster[match(rownames(train), rownames(boston_scaled2))]
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=as.factor(train$cluster))

5 Dimensionality Reduction Techniques

5.1. Human Data from United Nations

#load libraries
library(GGally)
library(corrplot)
library(tidyr)
library(dplyr)
human <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", header = TRUE)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...

Data Frame with 155 Rows and 8 Columns The human dataset contains various indicators of the well-being of various countries. The summary shows, there are altogether 155 observations (i.e. countries) and these are the variables: Edu2.FM: the ratio of females to males in secondary education Labo.FM: the ratio of females to males in the labour force Edu.Exp: expected number of years spent in schooling Life.Exp: life expectancy in years GNI: gross national income Mat.Mor: the relativised number of mothers who die at child birth Ado.Birth: the rate of teenage pregnancies leading to child birth Parli.F: the percentage of female parliamentarians

5.2. Data Exploration

#p <- ggpairs(human, mapping = aes(col="blue", alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot #cor(human) %>% round(digits=2) %>% corrplot.mixed(tl.cex = 0.7,number.cex=0.7,order=“hclust”,addrect=2)

cor_matrix<-cor(human)%>% round(digits =2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

* Edu.Exp is normally distributed * Edu.Exp is negatively correlated with Ado.Birth * Life.Exp is highly correlated to GNI and Edu.Exp * Life.Exp is negatively correlated to Mat.Mor * Parli.F doesnt have much impact on the other variables but correlaetd to Female to Male labour ratio

5.3. Principal Component Analysis on the not standardized human data

pca_human <- prcomp(human)
biplot(pca_human, choices = 1:2, cex = c(0.5,0.5), col = c("orange", "purple"), main="Biplot of the first two principal components for the unscaled data")
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Only GNI is visible Correlation is negligible High variance Qatar seems to be outlier

5.4 Principal Component Analysis on the Standardized human data

**Lets scale the data for better interpretation

human_scale <- scale(human)
summary(human_scale)
##     Edu2.FM           Labo.FM           Edu.Exp           Life.Exp      
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50

**Check PCA after scaling

pca_human <- prcomp(human_scale)
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("orange", "purple"), main="Biplot of the first two principal components for the scaled data")

Distribution of Countries explains the individual records. Data is now normalized with different parameters. Countries are now separated with respect to correlations with Principal Components

5.5. PCA Observations

**Observations * PC1 has high correlation with Mat.Mor * PC2 has high correlation with Parli.F * High correlation between Mat.Mor, Ado.Birth, Edu.Exp, Edu2.FM, Life.Exp and GNi, * Life.Exp, GNI, Edu.Exp, Edu2.FM is opposite side of Mat.Mor and Ado.Birth which indicates importance of Edu. Exp and Life expectancy * Female representation in Parliament seems to indicate high GNI, Life Expectancy and greater Female to Male labor ratio.

5.6. Multiple Correspondence Analysis (MCA)

In the case of categorical variables dimensionality reduction can be provided using correspondence analysis (CA, n=2 variables) or multiple corresponding analysis (MCA, n>2). MCA is a generalization of PCA and an extension of CA. In this analysis cross tabulation matrices between all variables in the data set are used. The information from cross-tabulations has to be presented in clear graphical form. MCA could be used for example as a pre-processing for clustering.

5.4.1 The data

Here we use tea data from FactoMineR package.

library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 3.3.2
library(tidyr)
library(ggplot2)
library(dplyr)
data("tea")
keep_columns <- c("Tea","price","How", "how", "sugar", "where","lunch")
tea_time <- dplyr::select(tea, one_of(keep_columns))
summary(tea_time)
##         Tea                  price        How     
##  black    : 74   p_branded      : 95   alone:195  
##  Earl Grey:193   p_cheap        :  7   lemon: 33  
##  green    : 33   p_private label: 21   milk : 63  
##                  p_unknown      : 12   other:  9  
##                  p_upscale      : 53              
##                  p_variable     :112              
##                  how           sugar                      where    
##  tea bag           :170   No.sugar:155   chain store         :192  
##  tea bag+unpackaged: 94   sugar   :145   chain store+tea shop: 78  
##  unpackaged        : 36                  tea shop            : 30  
##                                                                    
##                                                                    
##                                                                    
##        lunch    
##  lunch    : 44  
##  Not.lunch:256  
##                 
##                 
##                 
## 
str(tea_time)
## 'data.frame':    300 obs. of  7 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ price: Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...

Selected Parameters : Tea, price, How, how, sugar, where and lunch.

Graphically the date are presented as barplots using ggplot() function.

gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8)) 
## Warning: attributes are not identical across measure variables; they will
## be dropped

* Very Few peope drink Cheap tea. * Most people drink tea alone with tea bags and preferred tea variety is Earl Grey at Chain Shops.

Next MCA analysis on the chosen tea data is provided.

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.304   0.249   0.188   0.180   0.156   0.152
## % of var.             13.293  10.896   8.242   7.858   6.846   6.632
## Cumulative % of var.  13.293  24.189  32.431  40.289  47.136  53.767
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.148   0.140   0.130   0.124   0.118   0.101
## % of var.              6.463   6.128   5.691   5.441   5.151   4.428
## Cumulative % of var.  60.230  66.358  72.049  77.490  82.641  87.069
##                       Dim.13  Dim.14  Dim.15  Dim.16
## Variance               0.095   0.080   0.073   0.048
## % of var.              4.169   3.479   3.193   2.090
## Cumulative % of var.  91.238  94.717  97.910 100.000
## 
## Individuals (the 10 first)
##                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1               | -0.466  0.238  0.050 |  0.477  0.305  0.053 | -0.312
## 2               | -0.192  0.040  0.024 | -0.024  0.001  0.000 | -0.633
## 3               | -0.293  0.094  0.116 | -0.032  0.001  0.001 | -0.142
## 4               | -0.410  0.185  0.221 | -0.003  0.000  0.000 |  0.228
## 5               | -0.293  0.094  0.116 | -0.032  0.001  0.001 | -0.142
## 6               | -0.486  0.260  0.099 |  0.348  0.162  0.050 | -0.096
## 7               | -0.293  0.094  0.116 | -0.032  0.001  0.001 | -0.142
## 8               | -0.192  0.040  0.024 | -0.024  0.001  0.000 | -0.633
## 9               |  0.569  0.356  0.150 | -0.552  0.407  0.141 | -0.121
## 10              |  0.812  0.723  0.321 | -0.401  0.216  0.078 | -0.636
##                    ctr   cos2  
## 1                0.172  0.023 |
## 2                0.708  0.256 |
## 3                0.036  0.027 |
## 4                0.092  0.068 |
## 5                0.036  0.027 |
## 6                0.016  0.004 |
## 7                0.036  0.027 |
## 8                0.708  0.256 |
## 9                0.026  0.007 |
## 10               0.715  0.196 |
## 
## Categories (the 10 first)
##                     Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black           |   0.442   2.263   0.064   4.370 |   0.094   0.125
## Earl Grey       |  -0.222   1.492   0.089  -5.158 |  -0.182   1.225
## green           |   0.309   0.492   0.012   1.876 |   0.855   4.614
## p_branded       |  -0.602   5.401   0.168  -7.090 |   0.410   3.056
## p_cheap         |  -0.440   0.212   0.005  -1.175 |   0.377   0.190
## p_private label |  -0.717   1.692   0.039  -3.401 |   0.560   1.260
## p_unknown       |  -0.849   1.357   0.030  -2.998 |   0.635   0.925
## p_upscale       |   1.555  20.082   0.519  12.454 |   0.472   2.253
## p_variable      |   0.028   0.014   0.000   0.373 |  -0.768  12.621
## alone           |  -0.017   0.008   0.001  -0.392 |   0.146   0.798
##                    cos2  v.test     Dim.3     ctr    cos2  v.test  
## black             0.003   0.928 |  -1.103  22.739   0.398 -10.909 |
## Earl Grey         0.060  -4.231 |   0.424   8.782   0.325   9.853 |
## green             0.090   5.198 |  -0.009   0.001   0.000  -0.055 |
## p_branded         0.078   4.828 |  -0.073   0.127   0.002  -0.856 |
## p_cheap           0.003   1.008 |  -0.323   0.185   0.002  -0.863 |
## p_private label   0.024   2.658 |   0.206   0.226   0.003   0.978 |
## p_unknown         0.017   2.241 |  -0.047   0.007   0.000  -0.165 |
## p_upscale         0.048   3.777 |  -0.038   0.019   0.000  -0.305 |
## p_variable        0.351 -10.246 |   0.066   0.124   0.003   0.884 |
## alone             0.040   3.447 |  -0.077   0.289   0.011  -1.804 |
## 
## Categorical variables (eta2)
##                   Dim.1 Dim.2 Dim.3  
## Tea             | 0.090 0.104 0.416 |
## price           | 0.612 0.354 0.009 |
## How             | 0.055 0.087 0.392 |
## how             | 0.619 0.492 0.013 |
## sugar           | 0.051 0.003 0.316 |
## where           | 0.700 0.604 0.059 |
## lunch           | 0.000 0.099 0.114 |
plot(mca, invisible=c("ind"), habillage = "quali")

  • Tea Shop usually has unpackaged tea with high prices (p_upscale)
  • People usually have Earl Gray and more likely to have it with sugar and milk.
  • Green Tea is cut from all..(Explains a lot)